This document presents a complete Species Distribution Model (SDM) for the American Lobster (Homarus americanus) in the Gulf of Maine region. The workflow follows the mind-map framework provided in the course, progressing through six key stages:
The American Lobster is an iconic crustacean species of significant ecological and economic importance in the Northwest Atlantic. It is one of the most valuable fisheries in New England and Atlantic Canada. Understanding how climate change may shift its habitat is critical for fisheries management and conservation planning.
Before building my SDM, I need to understand the spatial data structures I’ll be working with. This includes point data (buoy locations, species observations) and raster data (environmental variables from the Brickman model).
The plot below shows my study region with NOAA buoy locations that monitor oceanographic conditions:
Gulf of Maine study region showing NOAA monitoring buoys.
Interpretation: The buoys are distributed throughout the Gulf of Maine and provide real-time oceanographic measurements. This region spans from approximately 39°N to 46°N latitude and 64°W to 76°W longitude.
I use the Brickman oceanographic model which provides downscaled climate projections for the Northwest Atlantic. The model includes variables like sea surface temperature (SST), salinity, and bottom conditions across different climate scenarios.
Monthly sea surface temperature (SST) from Brickman model for present conditions.
Interpretation: SST shows strong seasonal variation in the Gulf of Maine, ranging from cold winter temperatures (< 5°C) to warm summer conditions (> 20°C in coastal areas). This seasonality is crucial for understanding lobster habitat preferences throughout the year.
Species occurrence data is obtained from the Ocean
Biodiversity Information System (OBIS), a global open-access
repository for marine species observations. The
fetch_obis() function downloads records, and
read_obis() loads them for analysis.
Starting dataset: I begin with 209,167 American Lobster occurrence records from OBIS.
Raw occurrence data often contains issues that must be addressed before modeling:
Distribution of American Lobster observations by year.
Interpretation: The histogram shows observation effort has increased dramatically since the 1990s. The dashed red lines mark the Brickman climatology period. I retain records from 1970 onwards to capture sufficient historical data while maintaining relevance to my environmental predictors.
Distribution of American Lobster observations by month.
Interpretation: Observation effort varies by month, with more records during warmer months when field surveys are more feasible. This sampling bias will be accounted for in my background point generation.
We filter observations to only include those within the Brickman model domain (ocean areas with valid environmental data):
American Lobster observations overlaid on the Brickman ocean mask.
Interpretation: Points shown are American Lobster occurrences. Any observations falling on land or outside the model domain are removed. The Brickman mask ensures I only model areas where environmental predictions are available.
| Metric | Count |
|---|---|
| Starting observations | 209,167 |
| Final observations | 104,630 |
| Records removed | 104,537 |
Species distribution models require both presence data (where the species was observed) and pseudo-absence or background data (locations representing available habitat). Background points characterize the environmental conditions across the study area, allowing models to distinguish habitat preferences.
Spatial distribution of American Lobster observations by month.
Interpretation: American Lobster observations are concentrated in coastal areas, particularly around Massachusetts, Maine, and the Bay of Fundy. Observation counts vary by month, reflecting both species behavior and sampling effort. Many grid cells contain multiple overlapping observations.
To reduce spatial autocorrelation, I thin observations so that only one record per Brickman grid cell is retained per month. This prevents overweighting of heavily sampled areas.
Spatially thinned observations (one per grid cell per month).
Interpretation: After thinning, the observation counts are significantly reduced (compare with raw counts above). The spatial pattern is preserved, but each grid cell contributes only once per month, reducing pseudoreplication.
| Dataset | Total Records |
|---|---|
| Original observations | 104,630 |
| After spatial thinning | 5,728 |
I create a bias map based on observation density. Areas with more observations are weighted higher when sampling background points, which accounts for non-random sampling effort.
Sampling bias map based on observation density.
Interpretation: The bias map highlights coastal areas (especially around Massachusetts and Maine) where observation effort is highest. By using bias-weighted background sampling, I ensure that model training accounts for this uneven sampling.
I sample background points using biased sampling, with the number of background points per month matching the average observation count:
Background points per month: 8,719
Presence points (thinned observations) and background points by month.
Interpretation: Red points are presence locations (thinned observations) and blue points are background (pseudo-absence) locations. Background points are distributed across the study area following the bias map weighting, ensuring representation of available habitat conditions.
The Brickman model provides multiple environmental predictors. However, using highly correlated variables can cause multicollinearity issues in models. I assess pairwise correlations to select an appropriate subset.
Pairs plot showing correlations between Brickman environmental variables.
Interpretation: The pairs plot reveals strong correlations between some variables (e.g., SST and Tbtm). To avoid multicollinearity, I use automated filtering with a correlation threshold of 0.65.
Variables selected for modeling:
| Retained | Removed (collinear) |
|---|---|
| depth, month, SSS, U, Sbtm, V, Tbtm, MLD, SST | Xbtm |
I always include depth and month as ecologically important predictors for marine species.
This plot compares the environmental conditions at presence locations versus background locations:
Comparison of environmental conditions between presence and background points.
Interpretation: The density plots show how environmental conditions differ between presence (lobster locations) and background (available habitat). Variables where the two distributions differ substantially are likely important predictors. For example, I might observe that lobsters prefer specific depth ranges or temperature conditions.
Configuration saved with 9 predictor variables for model training.
Following the course workflow, I train four different machine learning algorithms and compare their performance:
I apply log-transformation to skewed variables (depth, Xbtm) and convert month to numeric for modeling.
To evaluate model performance on independent data, I create a spatial block split. This ensures training and testing data are geographically separated, providing a more realistic assessment of model transferability.
Spatial block split showing training (blue) and testing (red) data.
Interpretation: The spatial blocking ensures that nearby points are either all in training or all in testing, preventing spatial autocorrelation from inflating accuracy estimates.
Within the training data, I use 5-fold spatial cross-validation for hyperparameter tuning:
Five-fold spatial cross-validation structure.
Interpretation: Each color represents a different fold. During tuning, each fold takes a turn as the validation set while the others serve as training data.
one_row_of_training_data = dplyr::slice(tr_data, 1)
rec = recipe(one_row_of_training_data, formula = class ~ .)
wflow = workflow_set(
preproc = list(default = rec),
models = list(
glm = logistic_reg(mode = "classification") |> set_engine("glm"),
rf = rand_forest(mtry = tune(), trees = tune(), mode = "classification") |>
set_engine("ranger", importance = "impurity"),
btree = boost_tree(mtry = tune(), trees = tune(), tree_depth = tune(),
learn_rate = tune(), loss_reduction = tune(),
stop_iter = tune(), mode = "classification") |>
set_engine("xgboost"),
maxent = maxent(feature_classes = tune(), regularization_multiplier = tune(),
mode = "classification") |> set_engine("maxnet")
)
)Hyperparameter tuning results for each model.
Interpretation: This plot shows how different hyperparameter combinations affect model accuracy during cross-validation. Higher values indicate better performance.
| wflow_id | accuracy | boyce_cont | roc_auc | tss_max |
|---|---|---|---|---|
| default_glm | 0.8687392 | 0.4265954 | 0.6231085 | 0.2011549 |
| default_rf | 0.8634139 | 0.2951411 | 0.8119974 | 0.5036906 |
| default_btree | 0.8687392 | 0.5783140 | 0.7215759 | 0.3574409 |
| default_maxent | 0.5889465 | 0.9755823 | 0.7539606 | 0.3904795 |
Confusion matrices showing classification performance for each model.
Interpretation: The confusion matrices show true positives, true negatives, false positives, and false negatives for each model. Better models have higher values on the diagonal (correct predictions) and lower values off-diagonal (errors).
With trained models, I can now predict habitat suitability across the study area under current and future climate conditions. I use the Boosted Tree model for predictions as it typically achieves high accuracy.
## numeric
Habitat suitability prediction for American Lobster under present conditions.
Interpretation: This map shows the probability of American Lobster occurrence under current environmental conditions. Warmer colors (yellow/orange) indicate higher habitat suitability. The species shows strong preference for coastal shelf areas, with seasonal variation visible across months.
I generate predictions under two Representative Concentration Pathways (RCPs):
Each scenario is projected for years 2055 and 2075.
## numeric
Habitat suitability under RCP 4.5 climate scenario, year 2055.
## numeric
Habitat suitability under RCP 4.5 climate scenario, year 2075.
## numeric
Habitat suitability under RCP 8.5 climate scenario, year 2055.
## numeric
Habitat suitability under RCP 8.5 climate scenario, year 2075.
## numeric
## numeric
## numeric
## numeric
## numeric
Comparison of habitat suitability predictions across all climate scenarios.
Interpretation: Comparing across scenarios reveals how climate change may shift American Lobster habitat:
These predictions can inform fisheries management and conservation planning for this economically important species.
All predictions saved to disk for future analysis.
This project followed the complete Species Distribution Modeling workflow as outlined in the course mind-map:
| Chapter | Stage | Key Functions | Output |
|---|---|---|---|
| C00 | Setup | source("setup.R") |
Loaded packages and spatial data |
| C01 | Observations | fetch_obis() → read_obis() |
Filtered occurrence dataset |
| C02 | Background | thin_by_cell() → sample_background() |
Presence + background points |
| C03 | Covariates | filter_collinear() →
extract_brickman() |
Environmental predictors |
| C04 | Models | workflow_set() → workflow_map() →
workflowset_selectomatic() |
Trained model fits |
| C05 | Prediction | predict_stars() |
Habitat suitability maps |
Data Quality: Starting with 209,167 records, quality filtering retained 104,630 observations for modeling.
Environmental Predictors: After collinearity filtering, 9 variables were retained: depth, month, SSS, U, Sbtm, V, Tbtm, MLD, SST.
Model Performance: Four algorithms were trained and evaluated using spatial cross-validation to prevent overfitting.
Climate Projections: Habitat suitability predictions suggest potential range shifts under future climate scenarios, with more severe changes under RCP 8.5.
The American Lobster is a keystone species for the Gulf of Maine ecosystem and supports a multi-billion dollar fishery. Understanding how climate change may affect its distribution is critical for:
This analysis was conducted for JP297Dj: Ocean Forecasting - AI, Ecology, and Data Justice
Colby College, January 2026